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Abstract 

Background: Protein-protein interaction (PPI) extraction has been a focal point of many biomedical research and 
database curation tools. Both Active Learning and Semi-supervised SVMs have recently been applied to extract PPI 
automatically. In this paper, we explore combining the AL with the SSL to improve the performance of the PPI 
task. 

Methods: We propose a novel PPI extraction technique called PPISpotter by combining Deterministic Annealing- 
based SSL and an AL technique to extract protein-protein interaction. In addition, we extract a comprehensive set 
of features from MEDLINE records by Natural Language Processing (NLP) techniques, which further improve the 
SVM classifiers. In our feature selection technique, syntactic, semantic, and lexical properties of text are 
incorporated into feature selection that boosts the system performance significantly. 

Results: By conducting experiments with three different PPI corpuses, we show that PPISpotter is superior to the 
other techniques incorporated into semi-supervised SVMs such as Random Sampling, Clustering, and Transductive 
SVMs by precision, recall, and F-measure. 

Conclusions: Our system is a novel, state-of-the-art technique for efficiently extracting protein-protein interaction 
pairs. 



Background 

Automated protein-protein interaction (PPI) extraction 
from unstructured text collections is a task of significant 
interest in the bio-literature mining field. The most 
commonly addressed problem has been the extraction 
of binary interactions, where the system identifies which 
protein pairs in a sentence have a biologically relevant 
relationship between them. Proposed solutions include 
both hand-crafted rule-based systems and machine 
learning approaches [1]. Recently Semi-supervised 
Learning (SSL) techniques have been applied to PPI 
tasks [2]. SSL is a Machine Learning (ML) approach 
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that combines supervised and unsupervised learning 
where typically a small amount of labeled and a large 
amount of unlabeled data are used for training. SSL has 
gained significant attention to PPI extraction because of 
two reasons. First, labeling of a large set of instances is 
labor-intensive and time-consuming. This task has to be 
also carried out by qualified experts and thus is expen- 
sive. Second, several studies show that using unlabeled 
data for learning improves the accuracy of classifiers 
[3,4]. 

One major problem of SSL is that it may introduce 
incorrect labels to the training data, as the labeling is 
done by machine, and such labeling errors are critical to 
the classification performance. Active Learning (AL) can 
complement the SSL by reducing such labeling errors 
[5]. AL is a technique of selecting a small sample from 
the unlabeled data such that labeling on the sample 
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maximizes the learning accuracy. The selected sample is 
manually labeled by experts. In this paper, we explore 
combining the AL with the SSL to improve the perfor- 
mance of the PPI task. To our best knowledge, this is 
the first attempt to apply a combination of semi-super- 
vised and active learning for the extraction task of pro- 
tein-protein interaction. 

The contributions of this paper are three fold. First, 
we proposed a novel PPI extraction technique called 
PPISpotter by combining Deterministic Annealing-based 
SSL and an AL technique to extract protein-protein 
interaction. Second, we extracted a comprehensive set of 
features from MEDLINE records by Natural Language 
Processing (NLP) techniques, which further improve the 
SVM classifiers. In our feature selection technique, syn- 
tactic, semantic, and lexical properties of text are incor- 
porated into feature selection that boosts the system 
performance significantly. Third, we conducted experi- 
ments with three different PPI corpuses and showed 
that PPISpotter is superior to other techniques by preci- 
sion, recall, and F-measure. 

Many approaches have been proposed to extract pro- 
tein-protein interaction from unstructured text. One 
approach employs pre-specified patterns and rules for 
PPI extraction [6]. However, this approach is often inap- 
plicable to complex cases not covered by the pre-defined 
patterns and rules. Huang et al. [7] proposed a method 
where patterns are discovered automatically from a set 
of sentences by dynamic programming. 

The second approach utilizes dictionary. Blaschke et 
al. [8] extracted protein-protein interactions based on 
co-occurrence of the form "... pi. ..II... p2" within a 
sentence, where pi, p2 are proteins and II is an interac- 
tion term. Protein names and interaction terms (e.g., 
activate, bind, inhibit) are provided as a "dictionary." 
Pustejovsky et al. [9] extracted an "inhibit" relation for 
the gene entity from MEDLINE. Jenssen et al. [10] 
extracted gene-gene relations based on co-occurrence of 
the form "... gl...g2..." within a MEDLINE abstracts, 
where gl and g2 are gene names. Gene names were pro- 
vided as a "dictionary", harvested from HUGO, Locus- 
Link, and other sources. Although their study uses 
13,712 named human genes and millions of MEDLINE 
abstracts, no extensive quantitative results are reported 
and analyzed. Friedman et al. [11] extracted a pathway 
relation for various biological entities from a variety of 
articles. 

The third approach is based on machine learning 
techniques. Bunescu et al. [1] conducted protein/protein 
interaction identification with several learning methods 
such as pattern matching rule induction (RAPIER), 
boosted wrapper induction (BWI), and extraction using 
longest common subsequences (ELCS). ELCS automati- 
cally learns rules for extracting protein interactions 



using a bottom-up approach. They conducted experi- 
ments in two ways; one with manually crafted protein 
names and the other with the extracted protein names 
by their name identification method. In both experi- 
ments, Zhou et al. [12] proposed two novel semi-super- 
vised learning approaches, one based on classification 
and the other based on expectation-maximization, to 
train the HVS model from both annotated and un-anno- 
tated corpora. Song et al. [13] utilized syntactical, as well 
as semantic cues, of input sentences. By combining the 
text chunking technique and Mixture Hidden Markov 
Models, They took advantage of sentence structures and 
patterns embedded in plain English sentences. Temkin 
and Gilder [14] used a full parser with a lexical analyzer 
and a context free grammar (CFG) to extract protein- 
protein interaction from text. Alternatively, Yakushiji et 
al. [15] propose a system based on head-driven phrase 
structure grammar (HPSG). In their system protein 
interaction expressions are presented as predicate argu- 
ment structure patterns from the HPSG parser. These 
parsing approaches consider only syntactic properties of 
the sentences and do not take into account semantic 
properties. Thus, although they are complicated and 
require many resources, their performance is not satis- 
factory. Mitsumori et al. [16] used SVM to extract pro- 
tein-protein interactions. They use bag-of-words 
features, specifically the words around the protein 
names. These systems do not use any syntactic or 
semantic information. Miyao et al. [17] conducted a 
comparative evaluation of several state-of-the-art natural 
language parsers, focusing on the task of extracting pro- 
tein-protein interaction (PPI) from biomedical papers. 
They found marginal difference in terms of accuracy but 
more significant differences in parsing speed. BioPPISV- 
MExtractor is a recent PPI extraction system developed 
with SVM [2]. It utilizes rich feature sets such as word 
features, keyword feature, protein names distance fea- 
ture, and Link Grammar extraction results for protein- 
protein interaction extraction. They observed that the 
rich feature sets help improve recall at the cost of a 
moderate decline in precision. 

Cui et al. [18] applied an uncertainty sampling based 
method of active learning for a lexical feature-based 
SVM model to tag the most informative unlabeled sam- 
ples. They reported that the performance of the active 
learning-based technique on AIMED and CB corpora 
was significantly improved in terms of reduction of 
labelling cost. 

Methods 

In this section, we describe the overall architecture and 
procedures of PPISpotter (Figure 1). PPISpotter incorpo- 
rates AL models into SSL SVMs for extraction of pro- 
tein-protein interaction. PPISpotter also automatically 
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Figure 1 System architecture of PPISpotter 



converts a sentence into 9 feature sets based on the 
technique described in Section 4. 

Below is a set of steps that PPISpotter processes. 

Step 1: Preprocess the initial training data. The fea- 
ture selector applies the feature selection technique pro- 
posed in Section 4 to the preprocessed data sets. 

Step 2: Train the model. Two classifiers, Break Tie- 
based SVM (BT-SVM) and Deterministic Annealing- 
based SVM (DA-SVM) classifiers are combined to train 
the model (a.k.a. BTDA-SVM). Figure 2 illustrates how 
to combine these two techniques (Blue dot line is the 
BT-SVM procedure and red solid line is the DA-SVM 
procedure). At this stage, the human expert provides 
feedback to the system for a set of instances in the 
fuzzy unlabeled data. Note that the BT-SVM classifier is 
based on the Break Tie active learning approach and 




DA-SVM 



Figure 2 Combination of active learning with semi-supervised 
learning 



DA-SVM classifier is based on the Deterministic 
Annealing technique. 

Step 3: Take the input data and convert it to the same 
format as the training data. The feature selector per- 
forms the same task as in Step 1. 

Step 4: Apply the BTDA-SVM learner to identify sen- 
tences that contain protein-protein interaction. 

Step 5: Store extracted sentences to the database. 

Combination of active learning with semi-supervised 
learning 

One of the goals of this paper is to combine SSL and 
AL into a unified semi-supervised active learning techni- 
que for protein-protein interaction extraction. We 
employ a proportion of unlabeled data in the learning 
tasks in order to resolve the problem of insufficient 
training data. 

Our strategy of combining AL with SSL is inspired by 
the Tur et al/s study [5]. We employ the break tie AL 
technique (BT-SVM) to train a classifier on both labeled 
and unlabeled data, and return to the user the most 
relevant results. Then, the learning system trains a clas- 
sifier based on the Deterministic Annealing SSL techni- 
que (DA-SVM) on both the labeled and unlabeled data 
(S t , Sk, and S u ), and results in the final model (Figure 2). 

BTDA-SVM is a combination of the active learning 
algorithm presented in Section 4 and the semi-super- 
vised learning algorithm presented in Section 5. Instead 
of leaving out the instances classified with high confi- 
dence scores, this algorithm exploits them. Figure 3 
explains the BTDA-SVM algorithm. 

Active learning 

Active learning, known as pool-based active learning, is 
an interactive learning technique designed to reduce the 
labor cost of labeling in which the learning algorithm 
can freely assign the unlabeled data instances to the 
training set. The basic idea is to select the most infor- 
mative data instances for labeling by the users in the 
next learning round. In other words, the strategy of 
active learning is to select an optimal set of unlabeled 
data instances that minimizes the expected risk of the 
next round. 
Breaking tie (BT) 

For a given instance, the regular SVMs results in dis- 
tances among instances whose range is from 0 to 1. The 
value 0 means that the instance lies on the hyperplane 
and the value 1 indicates that the instance is a support 
vector. 

To assign a probability value to a class the sigmoid 
function can be used with the assumption that a prob- 
ability associated with a classifier indicates to which 
extent the classification result is trusted. In this case, 
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Figure 3 BTDA-SVM algorithm 



Luo et al. [19] defines the parametric model in the fol- 
lowing form: 



P(y = l|/) = 



1 



1 + exp(Af + B)' 



(1) 



where A and B are scalar values, which have to be 
estimated and / is the decision function of the SVMs. 
This parametric model is used for calculating the prob- 
abilities. To use this model, the SVM parameters (com- 
plexity parameter C, kernel parameter k) and the 
parameter A and B need to be calculated. Although 
cross validation can be used for this calculation, it is 
computationally expensive. An alternative is a pragmatic 
approximation method that all binary SVMs have the 
same A while eliminating B by assigning 0.5 to instances 
lying on the decision boundary and by trying to com- 
pute the SVM parameters and A simultaneously [19]. 



The decision function can be normalized by its margin 
to include the margin in the calculation of the probabil- 
ities. 



p P Ar = Uf) = 



i 



1 + expk^—y 



(2) 



where we currently look at class p and P pq is the prob- 
ability of class p versus class q. We assume that P pq , 
q=i,2,... are independent. The final probability for class 
p: 



q*p 



(3) 



It has been reported that the performance bases on 
this approximation is fast and accurate [19]. This prob- 
ability model serves as basis for the Breaking Tie algo- 
rithm for semi-supervised learning. 

Semi-supervised support vector machines 

Support Vector Machines (SVMs) is a supervised 
machine learning approach designed for solving two- 
class pattern recognition problems. SVMs adopts maxi- 
mum margin to find the decision surface that separates 
the positive and negative labeled training examples of a 
class [20]. 

Transductive Support Vector Machines (TSVMs) is an 
extended version of SVM that uses unlabeled data in 
addition to labeled data for train classifiers [21]. The goal 
of TSVMs is to determine which test data instances 
result in the maximum-margin hyperplane that separates 
the positive and negative examples for classifiers. Since 
every test instances need to be included in the SVM's 
objective function, finding the exact solution to the 
resulting optimization problem is intractable. To resolve 
this issue, Joachims [21] proposed an approximation 
algorithm. One issue of Joachims' approach, however, is 
that it requires the similar distribution of positive and 
negative instances between the test data and the training 
data. This requirement is difficult to meet particularly 
when the training data is small. The challenge is to find a 
decision surface that separates the positive and negative 
instances of the original labeled data and the unlabeled 
data to unlabeled data to be converted to labeled data 
with maximum margin. The unlabeled data sets apart the 
decision boundary from the dense regions, and the opti- 
mization problem is NP-hard [22]. Various approxima- 
tion algorithms are found in [22]. 

The optimization problem held in TSVMs is a non- 
linear non-convex optimization [23]. Past several years, 
researchers have attempted to solve this critical 
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problem. Chapell and Zien [24] proposed a smooth loss 
function, and a gradient descent method to find the 
decision boundary in a region of low density. Another 
technique is a branch-and-bound method [25] that 
searches for the optimal solution. But, it is applicable to 
a small number of examples due to involving the heavy 
computational cost. Despite the success of TSVM, the 
unlabeled data does not necessarily improve classifica- 
tion accuracy. 

As an alternative to TSVMs, we explore an Determi- 
nistic Annealing approach to semi-supervised SVMs. 
The first approach was proposed by Luo and his collea- 
gues [19] that formulated a probabilistic framework for 
image recognition. The Deterministic Annealing (DA) 
approach is the second proposed by Sindhwani et al. 
[26]. In the probabilistic framework, semi-supervised 
learning can be modeled as a missing data problem, 
which can be addressed by generative models such as 
mixture models. In the case of semi-supervised learning, 
probabilistic approaches provide us with various differ- 
ent ways to query unlabeled instances for labeling. A 
simple method is to train a model on the given labeled 
datasets and use this model on the unlabeled data. Each 
of these unlabeled instances is given probabilities that 
these instances belong to a given class. We can query 
the least certain instances or the most certain instances. 
The detailed description of the Deterministic Annealing 
semi-supervising learning is provided in the study of 
Luo and his colleagues [19]. 
Deterministic annealing (DA) 

Deterministic annealing (DA) is a special case of a 
homotopy method for combinatorial optimization pro- 
blems [26]. We adopt the DA technique proposed by 
Sindhwani et al. [26] to extraction of protein-protein 
interaction. The detailed description of applying DA for 
SVMs is provided by Sindhwani et al. [26]. 

Suppose one is given a following non-convex optimi- 
zation problem: 

y* =argmin yG(01}n F(y) (4) 

DA finds a local minimum of this in the following: 
First, DA treats the discrete variables as random binary 
variables over a space of probability distributions P. Sec- 
ond, to solve the optimization problem, DA finds a dis- 
tribution pi P that minimizes the expected value of F. 
It makes the optimization problem to be continuous. 
For this reason, an additional convex term is added to 
the objective function which is the entropy S of the dis- 
tribution denoted in Eq. 1. 

p*=argminE p (F(y))-T.S(p) ( 5 ) 



where the parameter T controls the trade-off between 
the expectation and the entropy (called the temperature 
of the problem) and y e {0,1} w are the discrete variables 
for the objective function F(y). For T - 0 and P includ- 
ing all point-mass distributions over {0,1} W , the global 
minimizer P in Eq. 1 will place all of its mass on the 
global minimizer of F. However, if T » 0, the entropy 
term in Eq. (1) dominates the objective function. With 
convexity, we can solve a sequence of problems for 
values of r 0 >T X > ... >T OQ = 0 where each of them is 
initialized at the solution obtained by the previous one. 
This sequence of temperatures is called as the annealing 
schedule. When T is close to zero the influence of the 
entropy term becomes shrunken. Therefore, the distri- 
bution becomes more concentrated on the minimum of 
E p [F\ which allows us to identify the discrete variables y 
by p. Note that there is no guarantee for global optimal- 
ly because there is not always a path connecting the 
local minimizers for the chosen sequence of T to the 
global optimum of F. 
Applying DA to SVMs 

Given a binary classification problem, we consider a set 
of L training pairs L = {(x lf yi),...{x L , y L )}> x e R w , y e 
{1, -1} and an unlabeled set of U test vectors U = {x L+1 , 
• ~>Xl+u} SVMs have a decision function /#(•) of the form 
fe(x) - w • <E>(x) + b, where 6 - (w, b) are the parameters 
of the model, and O(-) is the chosen feature map, often 
implemented implicitly using the kernel trick. Given a 
training set L and a test set U, for the TSVM optimiza- 
tion problem, find among the possible binary vectors {y 

- (yL+i>---> Jl+u)} the one such that an SVM trained on 
L \J(U x y) yields the largest margin. This combinatorial 
problem can be approximated as finding an SVM separ- 
ating the training set under constraints which force the 
unlabeled examples to be as far as possible from the 
margin. This can be written as minimizing 

— \\w\\ 2 + C^T^j + C* subject to 

i=l i=L+l 

| foiXi) |> 1-^,1 = 1 + 1, L + U and 

| /^(xj | > 1 - = L + 1,...,L + U . This minimization 
problem is equivalent to minimizing 

^ L L+U 

w*= min I|| w || 2 + cy HAyMxJ) + c , Yh 1 (| f e { Xi ) |) (6) 

where the function Hl(-) = rnax(0, 1--) is the classical 
Hinge Loss function. In other words, TSVM seeks a 
hyperplane w and a labeling of the unlabeled examples, 
so that the SVM objective function is minimized. The 
discussion in Deterministic Annealing motivates a con- 
tinuous objective function, 
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T T (f,p) = E p T(f,y)-TS(p)- (7) 

that defined by taking the expectation of z(f, y') (Eq. 1) 
with respect to a distribution p on y' and including 
entropy of p as a homotopy term. 

For a fixed T, the solution to the optimization pro- 
blem above is tracked as the temperature parameter T 
is lowered to 0. The DA algorithm returns the solution 
corresponding to the minimum value achieved when 
some stopping criterion is satisfied. The criterion used 
in the DA algorithm is the Pair-wise Mutual Informa- 
tion (PMI) between values of p in consecutive iterations. 
The parameter T is decreased in an outer loop until the 
total entropy falls below a threshold. 

Feature selection 

Rich feature sets improve accuracy of the PPI extraction 
task [27]. The features used in Yang's paper include word 
features, keyword features, protein name distance features, 
and link path features, etc. In this paper, we explore var- 
ious different features such as syntactic and lexical features 
as well as semantic features such as negated sentence fea- 
tures, interactor and its POS tag features into the feature 
sets. The total 9 features were selected for our semi-super- 
vised learning technique (See Table 1). 

Negation: We include whether a sentence is negated 
or not in the feature set. We use NegEx developed by 
Chapman and colleagues [28] for negation. NegEx is a 
regular expression-based approach that defines a fairly 
extensive list of negation phrases that appear before or 
after a finding of negation. NegEx treats a phrase as a 
negated one if a negation phrase appears within n words 
of a finding. The output of NegEx is the negation status 
assigned to each of the UMLS terms identified in the 
sentence: negated, possible or actual. NegEx uses the 
following regular expressions triggered by three types of 
negation phrases: 

<pre-UMLS negation phrase > {0-5 tokens} <UMLS 
term> and <UMLS term> {0-5 tokens} <post-UMLS 
negation phrase > 

Table 1 Features extracted from example sentence A 



Feature Feature Value 



Is negated sentence 


True 


No. of protein occurrences 


3 


Interactor name 


response 


Interactor POS 


NN 


Interactor position 


88 


No. of words in between proteins 


24 


No. of left words 


-1 


No. of right words 


12 


Link path status 


Yes 



There are three types of negation phrases in these 
expressions: 1) pre-UMLS, 2) post-UMLS and 3) pseudo 
negation phrases. Pre-UMLS phrases appear before the 
term they negate, while the post-UMLS phrases appear 
after the term they negate. Pseudo negation phrases are 
similar with negation phrases but are not reliable indica- 
tors of negation; they are used to limit the negation 
scope. All UMLS terms inside of the 0-5 tokens window 
are assigned the negation status depending on the nat- 
ure of the negation phrase: negated or possible. The 
example of the negated sentence processed with NegEx 
is as follows: 

[PREN].No[PREN] relevant changes in heart rate , 
body weight , and plasma levels of [NEGATEDJrenin 
[NEGATED] activity and aldosterone concentration 
were observed -> negated 

Number of proteins named entities (NE) occur- 
rences: We extracted protein names from each sentence 
by using a Conditional Random Field (CRF)-based 
Named Entity Recognition (NER) technique. 

To train the CRF NER, we used the training data pro- 
vided for the BioCreative II Gene Mention Tagging task. 
The training data consist of 20,000 sentences. Approxi- 
mately 44,500 GENE and ALTGENE annotations were 
converted to the MedTag database format [29]. Once 
we built the train model, we applied the CRF NER to 
extract proteins or genes from a sentence and counted 
the number of occurrences of genes in the sentence. 

Interactor: Interactor is the term that shows the inter- 
action among proteins in a sentence. The total of 220 
interactor terms was identified. We applied a modified 
UEA stemmer to take care of term variations of interac- 
tor [30]. We did not apply an aggressive stemmer like 
Porter stemmer since we wanted to preserve the POS 
tag of the interactor. 

Interactor POS: As for protein named entities, we 
applied the CRF-based POS tagging technique to tag 
tokenized words in a sentence. The CRF-based POS tag- 
ger was built on top of the MALLET package [31]. 

Interactor position: We included the position of the 
interactor term in a sentence in the feature set. 

Number of words in between proteins: We included 
the number of words in a left most Protein NE and a 
right most Protein NE in the feature set. 

Number of left words: We included the number of 
words in the left side of the first appearance of a Protein 
NE in the feature set. 

Number of right words: We included the number of 
words in the right side of the last appearance of Protein 
NE in the feature set. 

Link path status: This feature set is obtained by Link 
Grammar that was introduced by Lafferty et al. [32]. 
Link Grammar is used to connect pairs of words in a 
sentence with various links. Each word is linked with 
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connectors. A link consists of a left-pointing connector 
connected with a right-pointing connector of the same 
type on another word. A sentence is validated if all the 
words are connected. We assume that if a link path 
between two protein names exists, these two proteins 
have interaction relation. In our feature selection, if a 
Link path between two protein names exists, it is set to 
"Yes", otherwise, "No". The Link Grammar parser was 
used in several papers to extract protein-protein interac- 
tion [27,2]. 

Results 

Data sets 

One of the issues in protein-protein interaction extrac- 
tion is that different studies use different data sets and 
evaluation metrics. It makes it difficult to compare the 
results reported from the studies. 

In this paper, we used three different datasets that 
have been widely used in protein-protein interaction 
tasks. These are 1) the AIMED corpus, 2) the BioCreA- 
tIvE2 corpus that is provided as a resource by BioCreA- 
tlvE II (Critical Assessment for Information Extraction 
in Biology) challenge evaluation, and 3) Biolnfer corpus. 
Table 2 summarizes the characteristics of these three 
datasets. 

AIMED: Bunescu et al. [1] manually developed the 
AIMED corpus3 for protein-protein interaction and pro- 
tein name recognition. They tagged 199 Medline 
abstracts, obtained from the Database of Interacting 
Proteins (DIP) and known to contain protein interac- 
tions. This corpus is becoming a standard, as it has 
been used in the recent studies in several studies 
[1,15,16]. 

BioCreAtIvE2: is a corpus for protein-protein interac- 
tions, originated from the BioCreAtlvE task 1A data set 
for named entity recognition of gene/protein names. We 
randomly selected 1000 sentences from this set and 
added additional annotation for interactions between 
genes/proteins. 173 sentences contain at least one inter- 
action, 589 sentences contain at least one gene/protein. 
There are 255 interactions, some of which include more 
than two partners (e.g., one partner occurs with full 
name and abbreviated) [33]. 

Biolnfer: stands for Bio Information Extraction 
Resource. It was developed by Pyysalo et al. [34]. The 
corpus contains 1100 sentences from PubMed abstracts 



Table 2 Data sets used for experiments 



Data Set 


Total 
Sentences 


Positive 
Sentences 


Negative 
Sentences 


AIMED 


4026 


951 


3075 


BioCreative2 


4056 


2202 


1854 


Biolnfer 


1100 


573 


527 



annotated for relationships, named entities, as well as 
syntactic dependencies. 

Since previous studies that used these datasets per- 
formed 10-fold cross-validation, we also performed 10- 
fold cross-validation in these datasets and reported the 
average results over the runs. 

For evaluation methodology, we use precision, recall, 
F-score, and AUC as our metrics to evaluate the perfor- 
mances of the methods. 

Comparison techniques 

In this section, we briefly describe other techniques 
incorporated into semi-supervised SVMs and used to 
evaluate the performance of active semi-supervised 
learning models adopted in PPISpotter. 

Baseline: random sampling (RS-SVM) 

Random sampling of the unlabeled instances is a naive 
approach to semi-supervised learning. We use this 
approach to compare with the other semi-supervised 
learning approaches as several studies used this 
approach to compare it with other semi-supervised 
learning approaches [19,35]. 

Clustering (C-SVM) 

One technique is a clustering algorithm applied for the 
unlabeled data. Fung and Mangasarian [19] used the k- 
median clustering and showed that the performance was 
competitive comparing to a supervised learning. The 
downside of a clustering approach is the correct number 
of the clusters needs to be pre-defined. We initially tried 
the two clustering techniques: K-means and Kernel K- 
means and found that there was only marginal differ- 
ence in terms of performance. Therefore, we use K- 
means for the performance comparison. 

Supervised SVMs (SVM) 

The kernel we used as the baseline supervised SVM 
model is a linear kernel. One of the advantages of super- 
vised SVMs with a linear kernel is that it can handle 
high dimensional data effectively. The reason is it com- 
pares the "active" features rather than the complete 
dimensions. This way, we can impose richer feature sets 
upon each training example to enhance system perfor- 
mance. The richer feature sets showed to be more effec- 
tive than the simple feature sets [2]. Another advantage 
of linear kernel SVM is its low training and testing time 
costs. In addition, using linear kernel SVM only penalty 
parameter C needs to be adjusted in the algorithm, 
which is usually set as a constant in applications. In our 
experiments, the SVM-light package was used. The pen- 
alty parameter C in setting the SVM is an important 
parameter since it controls the tradeoff between the 
training error and the margin. The SVM-light package 
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does an excellent job on setting the default value for this 
parameter. In our experiments the parameter was left as 
default value since we observed that other manually 
determined values of this parameter in fact led to worse 
performance of supervised SVMs when compared with 
the default one. 

Discussion 

We evaluate and compare the performance of the active 
semi-supervised machine learning approach (BTDA- 
SVM) in several different ways. First, we compare it 
with three different techniques: random sampling, K- 
means clustering, and supervised SVMs. In addition, we 
test the performance of BTDA-SVM with supervised 
counterparts (SVMs) as well as an active learning tech- 
nique (BT-SVM) for the task of protein-protein interac- 
tion extraction. Second, we exam whether the size of 
combined training datasets between unlabeled and 
labeled data have impact on the performance. As dis- 
cussed in Section 3, we Break Tie and Deterministic 
Annealing, as a kernel function in BTDA-SVM. 

Table 3 shows the results obtained with the AIMED 
data set. Our approach (BTDA-SVM) performs consid- 
erably better than other techniques in terms of preci- 
sion, recall, and F-measure. BTDA-SVM's performance 
is superior to the regular SVMs approach by 34.79% in 
terms of precision. It is 25.55% better than the Random 
Sampling approach (RS-SVM) in terms of recall. In 
terms of F-measure, BTDA-SVM is 28.6% better than 
the regular SVMs. The Break Tie approach (BT-SVM) is 
the second best in terms of three measures. 

We conducted individual t-tests essentially as specific 
comparisons. Our prediction that BTDA-SVM would be 
better than the other comparison techniques (BT-SVM, 
SVM, RS-SVM, and C-SVM) was confirmed t(ll) 
=3.6966E-11, p<0.05 (one-tailed) at n-1 degrees of free- 
dom (12 runs) while comparing with C-SVM which per- 
formed best over the other two comparison techniques. 
Similarly, the t-test confirmed that the performance dif- 
ference of BT-SVM is statistically significant from C- 
SVM t(ll)=0.0169, p<0.05 (one-tailed). 



Table 3 Experimental results - AIMED data set 



Algorithms 




Measures 






Precision 


Recall 


F-score 


SVM 


55.15% 


42.47% 


48.14% 


RS-SVM 


56.98% 


41.71% 


48.92% 


C-SVM 


64.53% 


40.42% 


50.67% 


BT-SVM 


65.23% 


42.51% 


53.64% 


BTDA-SVM 


74.34% 


50.75% 


61.91% 


Yakushiji et al. [15] 


33.70% 


33.10% 


33.40% 


Mitsumori et al. [16] 


54.20% 


42.60% 


47.70% 



In Table 3, we also show the results obtained pre- 
viously in the literature by using the same data set. 
Yakushiji et al. [15] used an HPSG parser to produce 
predicate argument structures. They utilized these struc- 
tures to automatically construct protein interaction 
extraction rules. Mitsumori et al. [16] used SVMs with 
the unparsed text around the protein names as features 
to extract protein interaction sentences. 

Semi-supervised approaches are usually claimed to be 
more effective when there is less labeled data than unla- 
beled data, which is usually the case in real applications. 
To see the effect of semi-supervised approaches we per- 
form experiments by varying the amount of labeled 
training sentences in the range [10, 3000]. For each 
labeled training set size, sentences are selected randomly 
among all the sentences, and the remaining sentences 
are used as the unlabeled test set. The results that we 
report are the averages over 10 such random runs for 
each labeled training set size. We report the results for 
the algorithms when edit distance based similarity is 
used, as it mostly performs better than cosine similarity. 

Figure 4 and 5 show the performance differences of 
five SVM-based learning techniques as the size of train- 
ing data increases. BTDA-SVM performs considerably 
better than their supervised counterpart SVM, RS-SVM, 
C-SVM when we have small number of labeled training 
data. It is interesting to note that, although SVM is one 
of the best performing algorithms with more training 
data, it is the worst performing algorithm with small 
amount of labeled training sentences. Its performance 
starts to increase when number of training data is larger 
than 200. Eventually, its performance gets close to that 
of the other algorithms. Harmonic function is the best 
performing algorithm when we have less than 200 
labeled training data. 

BTDA-SVM consistently outperforms other techni- 
ques in this experiment. We observed that most of the 
techniques made significant improvement when the 
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Figure 4 The F-score on the AIMED dataset with varying sizes of 
training data 
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Figure 5 The F-score on the BioCreative II PPI dataset with varying 
sizes of training data 



training data reaches 200 training instances. Compared 
to other techniques, BTDA-SVM did not make a radical 
change to the size of training data. 

Table 4 shows the experimental results with the Bio- 
Creative2 PPI data set. The performance with BTDA- 
SVM is always better than other techniques by three 
measures. BTDA-SVM outperforms the regular SVMs 
(SVM) by 22.34%, 86.13%, and 48.89% respectively in 
terms of precision, recall, and F-measure. The second 
best performance is achieved by BT-SVM in terms of 
three measures. 

In t-test, we predict that BTDA-SVM would be better 
than the other three comparison techniques (SVM, RS- 
SVM, and C-SVM), and the prediction was confirmed t 
(11)=0.0312, p<0.05 (one-tailed) at n-1 degrees of free- 
dom (12 runs) while comparing with C-SVM. However, 
our prediction that BT-SVM would be better than C- 
SVM was not confirmed t(ll)=0.092, p<0.05 (one- 
tailed). As shown in Figure 5, performance curves are 
different from ones with the AIMED data set. The per- 
formance of SVM and RS-SVM is consistently inferior 
to C-SVM, BT-SVM, and BTDA-SVM. 

Although BTDA-SVM consistently outperforms other 
techniques in this experiment, it does not show statisti- 
cal significance (In t-test, t(6)=0.2124, p<0.05 (one- 
tailed) at n-1 degrees of freedom). In addition, all 



techniques did not make a radical change to the size of 
training data. 

We reported the performance of five comparison tech- 
niques with the Biolnfer data set. Table 5 shows the 
experimental results in terms of precision, recall, and 
AUC. BTDA-SVM's performance is the best over the 
other four techniques. It is better than the regular 
SVMs approach by 25.23%, 19.41%, and 10.32% in terms 
of precision, recall, and AUC respectively. With respect 
to AUC, the results of the t-test indicates that BTDA- 
SVM's performance is statistically significantly better 
than the other three comparison techniques (SVM, RS- 
SVM, and C-SVM), t(ll)=8.3483E-6, p<0.05 (one-tailed) 
at n-1 degrees of freedom (12 runs) while comparing 
with C-SVM which performed best over the other two 
comparison techniques. In the same vein, our prediction 
that BT-SVM would be better than C-SVM was con- 
firmed t(ll)=0.00025, p<0.05 (one-tailed). 

Conclusions 

The goal of our study is two-fold: The first is to explore 
integrating an active learning technique with semi- 
supervised SVMs to improve the performance of classi- 
fiers. The second is to propose rich, comprehensive fea- 
ture sets for the protein-protein interaction. To this 
end, we presented an active semi-supervised SVM-based 
PPI extraction system, PPISpotter, which encompasses 
the entire procedure of PPI extraction from the biome- 
dical literature: protein name recognition, rich feature 
selection, and PPI extraction. In PPI extraction stage, 
besides several common features such as word features 
and keyword features, some new useful features includ- 
ing protein names distance feature, phrase negation, and 
link path feature were introduced for the supervised 
learning problem. We combined an active learning tech- 
nique, Break Tie (BT-SVM) with the Deterministic 
Annealing-based semi-supervised learning technique 
(DA-SVM), which serves the core algorithm for the 
PPISpotter system (BTDA-SVM). This BTDA-SVM 
technique, compared with four different techniques 
including an active learning technique (BT-SVM), was 
tested on three widely used PPI corpora. The experi- 
mental results indicated that our technique, BTDA- 



Table 4 Experimental results - BioCreative2 PPI data set 



Algorithms 




Measures 






Precision 


Recall 


F-score 


SVM 


70.23% 


51.21% 


58.33% 


RS-SVM 


71.7% 


56.54% 


62.5% 


C-SVM 


78.23% 


88.68% 


83.65% 


BT-SVM 


81.75% 


93.5% 


85.96% 


BTDA-SVM 


85.92% 


95.32% 


86.85% 


TSVM-edit [36] 


85.62% 


84.89% 


85.22% 



Table 5 Comparison results - Biolnfer data set 



Algorithms 




Measures 






Precision 


Recall 


AUC 


SVM 


65.89% 


54.6% 


0.843 


RS-SVM 


64.5% 


55.2% 


0.847 


C-SVM 


70.24% 


60.2% 


0.86 


BT-SVM 


79.29% 


63.1% 


0.918 


BTDA-SVM 


82.52% 


65.2% 


0.93 


Graph Kernel [1] 


47.7% 


59.9% 


0.849 
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SVM, achieves statistically significant improvement over 
the other three techniques in terms of precision, recall, 
F-measure, and AUG 

In future work, we plan to further explore the charac- 
teristics of active learning approaches to semi-supervised 
SVMs and refine our approach to achieve a better PPI 
extraction performance. 
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